18/02/2019
Posted on flickr by BBVAtech in 2012, by Asigra CC BY 2.0
Image by Jennifer Dutcher, datascience@berkeley, source: https://datascience.berkeley.edu/what-is-big-data/
"Big Data is the result of collecting information at its most granular level — it’s what you get when you instrument a system and keep all of the data that your instrumentation is able to gather."
Jon Bruner (Editor-at-Large, O’Reilly Media)
"Big data is data that contains enough observations to demand unusual handling because of its sheer size, though what is unusual changes over time and varies from one discipline to another."
Annette Greiner
(Lecturer, UC Berkeley School of Information)
"[…] 'big data' will ultimately describe any dataset large enough to necessitate high-level programming skill and statistically defensible methodologies in order to transform the data asset into something of value."
Reid Bryant
(Data Scientist, Brooks Bell)
'Big Data Landscape (2018)', source: http://mattturck.com
Photo by Joe Parks, (CC BY-NC 2.0) source: https://flic.kr/p/e2umhv
Astronomy: SKA Radio Telescope
Image by SKA Organisation, source: https://www.skatelescope.org/multimedia/image
Astronomy: SKA Radio Telescope
Image by SKY Organisation, source: https://www.skatelescope.org/multimedia/image
Source: Bollen, Mao, and Zeng (2011)
Source: Bollen, Mao, and Zeng (2011)
Source: Ranco (2015)
‐ Understand the concept of Big Data in the context of economic research.
‐ Understand the technical challenges of Big Data Analytics and how to practically deal with them.
‐ Students will know the basic statistical techniques of clustering, dimensionality reduction, and factor models.
Preparations
# read dataset into R
economics <- read.csv("../data/economics.csv")
# have a look at the data
head(economics, 2)
## date pce pop psavert uempmed unemploy ## 1 1967-07-01 507.4 198712 12.5 4.5 2944 ## 2 1967-08-01 510.5 198911 12.5 4.7 2945
# create a 'large' dataset out of this
for (i in 1:3) {
economics <- rbind(economics, economics)
}
dim(economics)
## [1] 4592 6
Compute the real personal consumption expenditures (pce): Divide each value of pce by the deflator 1.05.
# Naïve approach (ignorant of R)
deflator <- 1.05 # define deflator
# iterate through each observation
pce_real <- c()
n_obs <- length(economics$pce)
for (i in 1:n_obs) {
pce_real <- c(pce_real, economics$pce[i]/deflator)
}
# look at the result
head(pce_real, 2)
## [1] 483.2381 486.1905
How long does it take?
# Naïve approach (ignorant of R)
deflator <- 1.05 # define deflator
# iterate through each observation
pce_real <- list()
n_obs <- length(economics$pce)
time_elapsed <-
system.time(
for (i in 1:n_obs) {
pce_real <- c(pce_real, economics$pce[i]/deflator)
})
time_elapsed
## user system elapsed ## 0.163 0.012 0.176
Assuming a linear time algorithm (\(O(n)\)), we need that much time for one additional row of data:
time_per_row <- time_elapsed[3]/n_obs time_per_row
## elapsed ## 3.832753e-05
If we deal with big data, say 100 million rows, that is
# in seconds (time_per_row*100^4)
## elapsed ## 3832.753
# in minutes (time_per_row*100^4)/60
## elapsed ## 63.87921
# in hours (time_per_row*100^4)/60^2
## elapsed ## 1.064654
What happens in the background?
Can we improve this?
# Improve memory allocation (still somewhat ignorant of R) deflator <- 1.05 # define deflator n_obs <- length(economics$pce) pce_real <- list() # allocate memory beforehand # tell R how long the list will be length(pce_real) <- n_obs
Can we improve this?
# Improve memory allocation (still somewhat ignorant of R)
deflator <- 1.05 # define deflator
n_obs <- length(economics$pce)
pce_real <- list()
# allocate memory beforehand
# tell R how long the list will be
length(pce_real) <- n_obs
# iterate through each observation
time_elapsed <-
system.time(
for (i in 1:n_obs) {
pce_real[[i]] <- economics$pce[i]/deflator
})
time_elapsed
## user system elapsed ## 0.028 0.001 0.028
Any improvements?
time_per_row <- time_elapsed[3]/n_obs time_per_row
## elapsed ## 6.097561e-06
# in seconds (time_per_row*100^4)
## elapsed ## 609.7561
# in minutes (time_per_row*100^4)/60
## elapsed ## 10.1626
# in hours (time_per_row*100^4)/60^2
## elapsed ## 0.1693767
This looks much better, but we can do even better…
Can we improve this?
# Do it 'the R wqy'
deflator <- 1.05 # define deflator
# Exploit R's vectorization!
time_elapsed <-
system.time(
pce_real <- economics$pce/deflator
)
# same result
head(pce_real, 2)
## [1] 483.2381 486.1905
# but much faster! time_elapsed
## user system elapsed ## 0 0 0
time_per_row <- time_elapsed[3]/n_obs
In fact, system.time() is not precise enough to capture the time elapsed…
# in seconds (time_per_row*100^4)
## elapsed ## 0
# in minutes (time_per_row*100^4)/60
## elapsed ## 0
# in hours (time_per_row*100^4)/60^2
## elapsed ## 0
Walkowiak, Simon (2016): Big Data Analytics with R.Birmingham, UK: Packt Publishing.
Bollen, Johan, Huina Mao, and Xiaojun Zeng. 2011. “Twitter Mood Predicts the Stock Market.” Journal of Computational Science 2 (1): 1–8. doi:https://doi.org/10.1016/j.jocs.2010.12.007.
Ranco, Darko AND Caldarelli, Gabriele AND Aleksovski. 2015. “The Effects of Twitter Sentiment on Stock Price Returns.” PLOS ONE 10 (9). Public Library of Science: 1–21. doi:10.1371/journal.pone.0138441.